lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (3894B)


      1 +++
      2 title = 'Reinforcement learning'
      3 template = 'page-math.html'
      4 +++
      5 # Reinforcement learning
      6 
      7 ## What is reinforcement learning?
      8 Agent is in a state, takes an action.
      9 Action is selected by policy - function from states to actions.
     10 The environment tells the agent its new state, and provides a reward (number, higher is better).
     11 The learner adapts the policy to maximise expectation of future rewards.
     12 
     13 Markov decision process: optimal policy may not depend on previous state, only info in current state counts.
     14 
     15 ![90955f3da8fb0d61c2fa9f3033c65098.png](e78427ef0d0845d0ae21e1c7857d2740.png)
     16 
     17 Sparse loss:
     18 - start with imitation learning - supervised learning, copying human action
     19 - reward shaping - guessing reward for intermediate states, or states close to good states
     20 - auxiliary goals - curiosity, max distance traveled
     21 
     22 policy network: NN with input of state, output of action, and a softmax output layer to produce prob distribution.
     23 
     24 three problems of RL:
     25 - non differentiable loss
     26 - balance exploration and exploitation
     27     - this is a classic trade-off in online learning
     28     - for example, an agent in a maze may train to reach a reward of 1 that's close by and exploit that reward, and so it might never explore further and reach the 100 reward at the end of the maze
     29 - delayed reward/sparse loss
     30     - you might take an action that causes a negative result, but the result won't show up until some time later
     31     - for example, if you start studying before an exam, that's a good thing.
     32       the issue is that you started one day before, and didn't do jack shit during the preceding two weeks.
     33     - credit assignment problem: how do you know which action takes the credit for the bad result?
     34 
     35 deterministic policy - every state followed by same action.
     36 probabilistic policy - all actions possible, certain actions higher probability.
     37 
     38 ## Approaches
     39 how do you choose the weights (how do you learn)?
     40 simple backpropagation doesn't work - we don't have labeled examples to tell us which move to take for given state.
     41 
     42 ### Random search
     43 pick random point m in model space.
     44 
     45 ```
     46 loop:
     47     pick random point m' close to m
     48     if loss(m') < loss(m):
     49         m <- m'
     50 ```
     51 
     52 "close to" is sampled uniformly among all points with some pre-chosen distance r from w.
     53 ### Policy gradient
     54 follow some semi-random policy, wait until reach reward state, then label all previous state-action pairs with final outcome.
     55 i.e. if some actions were bad, on average will occur more often in sequences ending with negative reward, and on average will be more often labeled as bad.
     56 
     57 ![442f7f9bc5e14ffbbcfd54f6ea6b72df.png](c484829362004f90be2b33a92acf7fd9.png)
     58 
     59 $\nabla 𝔼_a r(a) = \nabla \sum_{a} p(a) r(a) = 𝔼_{a} r(a) \nabla \ln{p(a)}$, r is the ultimate reward at the end of the trajectory.
     60 
     61 ### Q-learning
     62 If I need this, I'll make better notes, can't really understand it from the slides.
     63 
     64 ## Alpha-stuff
     65 ### AlphaGo
     66 starts with imitation learning.
     67 improve by playing against previous iterations and self. trained by reinforcement learning using policy gradient descent to update weights.
     68 during play, use Monte Carlo Tree Search, with node values being the prob that black will win from that state.
     69 
     70 ### AlphaZero
     71 learns from scratch, there's no imitation learning or reward shaping.
     72 also applicable to other games like chess.
     73 
     74 Improves AlphaGo by:
     75 - combining policy and value nets
     76 - viewing MCTS as policy improvement operator
     77 - adding residual connections, batch normalization
     78 
     79 ### AlphaStar
     80 This shit can play starcraft.
     81 
     82 Real time, imperfect information, large diverse action space, and no single best strategy.
     83 Its behaviour is generated by a deep NN that gets input from game interface, and outputs instructions that are an action in the game.
     84 
     85 it has a transformer torso for units
     86 deep LSTM core with autoregressive policy head, and pointer network.
     87 makes use of multi-agent learning.